Exploring CAGE data in R

Leonie Roos and Nevena Cvetesic
11. 01. 2017.

Overview


  • Data formats for downstream manipulation
  • Annotation for genomic features
  • Promoter classification
  • Oligonucleotide heatmaps

What can we get out of a CAGEset?


dataframes & matrices returned by CAGEr functions:

  • Per individual sample ( raw or normalized tag counts)

    • CTSS
    • Tag clusters & interquantile widths
    • Consensus clusters & interquantile widths

  • Across samples (one output for multiple samples)

    • Consensus clusters & interquantile widths

These can be easily manipulated by the user in R.

We'll go through this in more detail in the tutorial

CAGE data per sample


Other than consensus clusters most data is extracted per sample as they are sample specific

  • genomic coordinates of CTSS
  • genomic coordinates of Tag clusters


Likely you want to perform the same analyses, checks, plots, etc for all the samples.

Solution?

Automate what you're doing! Functions are great to avoid mistakes

  • create a list for example that contain all the info of samples per slot

We'll create a few of these in the tutorial

CAGE data - Genomic Features


Where does most of the signal fall?


Nepal, C., et al. (2013). Dynamic regulation of coding and non-coding transcription initiation landscape at single nucleotide resolution during vertebrate embryogenesis. Genome Research, 23(11):1938-1950.

CAGE data - Genomic Features


we count the overlap our TCs per sample for each feature and count the occurances per sample:

  • promoter
  • 5kb upstream of promoter
  • exons
  • introns
  • gene

Promoter Width


Promoter interquantile widths

  • determined earlier in CAGEr
  • set between q0.1 - 0.9


ggplot2 to create other types of graphs than CAGEr offers

Sequence Features of Core Promoters


Distinguishing features of sharp and broad promoters

  • Genomic position of dominant TSS
  • Add a window (200 bp for example) around it
  • Plot the average oligonucleotide profiles

Based on interquantile widths: sharp < 10 & broad >= 10

Taken from R package: seqPattern.

Heatmaps


Another great visualization tool is density heatmaps:

These plot the density of oligonucleotides such as TA, CG, WW, SS

Striped line indicates dominant CTSS)

  • 400 bp up and downstream
  • Sequences are ordered by interquantile width of Tag cluster

Produced by heatmaps R package

Heatmaps - Order is important

The order of sequences is important when looking for architectures:


Similar sample but ordered differently

Tutorial


The tutorial that is linked with this presentation:

Tutorials dir

CAGE-workshop-Tutorial-P3-ExploringCageInR